Goto

Collaborating Authors

 McKenzie County


Minimum Wasserstein distance estimator under covariate shift: closed-form, super-efficiency and irregularity

Lang, Junjun, Zhang, Qiong, Liu, Yukun

arXiv.org Machine Learning

Covariate shift arises when covariate distributions differ between source and target populations while the conditional distribution of the response remains invariant, and it underlies problems in missing data and causal inference. We propose a minimum Wasserstein distance estimation framework for inference under covariate shift that avoids explicit modeling of outcome regressions or importance weights. The resulting W-estimator admits a closed-form expression and is numerically equivalent to the classical 1-nearest neighbor estimator, yielding a new optimal transport interpretation of nearest neighbor methods. We establish root-$n$ asymptotic normality and show that the estimator is not asymptotically linear, leading to super-efficiency relative to the semiparametric efficient estimator under covariate shift in certain regimes, and uniformly in missing data problems. Numerical simulations, along with an analysis of a rainfall dataset, underscore the exceptional performance of our W-estimator.


Unsupervised Document and Template Clustering using Multimodal Embeddings

Sampaio, Phillipe R., Maxcici, Helene

arXiv.org Artificial Intelligence

We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.


FLOWR: Flow Matching for Structure-Aware De Novo, Interaction- and Fragment-Based Ligand Generation

Cremer, Julian, Irwin, Ross, Tibo, Alessandro, Janet, Jon Paul, Olsson, Simon, Clevert, Djork-Arné

arXiv.org Artificial Intelligence

We introduce FLOWR, a novel structure-based framework for the generation and optimization of three-dimensional ligands. FLOWR integrates continuous and categorical flow matching with equivariant optimal transport, enhanced by an efficient protein pocket conditioning. Alongside FLOWR, we present SPINDR, a thoroughly curated dataset comprising ligand-pocket co-crystal complexes specifically designed to address existing data quality issues. Empirical evaluations demonstrate that FLOWR surpasses current state-of-the-art diffusion- and flow-based methods in terms of PoseBusters-validity, pose accuracy, and interaction recovery, while offering a significant inference speedup, achieving up to 70-fold faster performance. In addition, we introduce FLOWR:multi, a highly accurate multi-purpose model allowing for the targeted sampling of novel ligands that adhere to predefined interaction profiles and chemical substructures for fragment-based design without the need of re-training or any re-sampling strategies


ROSFD: Robust Online Streaming Fraud Detection with Resilience to Concept Drift in Data Streams

Yelleti, Vivek

arXiv.org Artificial Intelligence

Continuous generation of streaming data from diverse sources, such as online transactions and digital interactions, necessitates timely fraud detection. Traditional batch processing methods often struggle to capture the rapidly evolving patterns of fraudulent activities. This paper highlights the critical importance of processing streaming data for effective fraud detection. To address the inherent challenges of latency, scalability, and concept drift in streaming environments, we propose a robust online streaming fraud detection (ROSFD) framework. Our proposed framework comprises two key stages: (i) Stage One: Offline Model Initialization. In this initial stage, a model is built in offline settings using incremental learning principles to overcome the "cold-start" problem. (ii) Stage Two: Real-time Model Adaptation. In this dynamic stage, drift detection algorithms (viz.,, DDM, EDDM, and ADWIN) are employed to identify concept drift in the incoming data stream and incrementally train the model accordingly. This "train-only-when-required" strategy drastically reduces the number of retrains needed without significantly impacting the area under the receiver operating characteristic curve (AUC). Overall, ROSFD utilizing ADWIN as the drift detection method demonstrated the best performance among the employed methods. In terms of model efficacy, Adaptive Random Forest consistently outperformed other models, achieving the highest AUC in four out of five datasets.


A unified weighting framework for evaluating nearest neighbour classification

Lenz, Oliver Urs, Bollaert, Henri, Cornelis, Chris

arXiv.org Machine Learning

We present the first comprehensive and large-scale evaluation of classical (NN), fuzzy (FNN) and fuzzy rough (FRNN) nearest neighbour classification. We show that existing proposals for nearest neighbour weighting can be standardised in the form of kernel functions, applied to the distance values and/or ranks of the nearest neighbours of a test instance. Furthermore, we identify three commonly used distance functions and four scaling measures. We systematically evaluate these choices on a collection of 85 real-life classification datasets. We find that NN, FNN and FRNN all perform best with Boscovich distance. NN and FRNN perform best with a combination of Samworth rank- and distance weights and scaling by the mean absolute deviation around the median ($r_1$), the standard deviaton ($r_2$) or the interquartile range ($r_{\infty}^*$), while FNN performs best with only Samworth distance-weights and $r_1$- or $r_2$-scaling. We also introduce a new kernel based on fuzzy Yager negation, and show that NN achieves comparable performance with Yager distance-weights, which are simpler to implement than a combination of Samworth distance- and rank-weights. Finally, we demonstrate that FRNN generally outperforms NN, which in turns performs systematically better than FNN.


How many samples are needed to leverage smoothness?

Cabannes, Vivien, Vigogna, Stefano

arXiv.org Machine Learning

A core principle in statistical learning is that smoothness of target functions allows to break the curse of dimensionality. However, learning a smooth function seems to require enough samples close to one another to get meaningful estimate of high-order derivatives, which would be hard in machine learning problems where the ratio between number of data and input dimension is relatively small. By deriving new lower bounds on the generalization error, this paper formalizes such an intuition, before investigating the role of constants and transitory regimes which are usually not depicted beyond classical learning theory statements while they play a dominant role in practice.


Classifying token frequencies using angular Minkowski $p$-distance

Lenz, Oliver Urs, Cornelis, Chris

arXiv.org Artificial Intelligence

Angular Minkowski $p$-distance is a dissimilarity measure that is obtained by replacing Euclidean distance in the definition of cosine dissimilarity with other Minkowski $p$-distances. Cosine dissimilarity is frequently used with datasets containing token frequencies, and angular Minkowski $p$-distance may potentially be an even better choice for certain tasks. In a case study based on the 20-newsgroups dataset, we evaluate clasification performance for classical weighted nearest neighbours, as well as fuzzy rough nearest neighbours. In addition, we analyse the relationship between the hyperparameter $p$, the dimensionality $m$ of the dataset, the number of neighbours $k$, the choice of weights and the choice of classifier. We conclude that it is possible to obtain substantially higher classification performance with angular Minkowski $p$-distance with suitable values for $p$ than with classical cosine dissimilarity.


Online nearest neighbor classification

Dasgupta, Sanjoy, So, Geelon

arXiv.org Artificial Intelligence

We study an instance of online non-parametric classification in the realizable setting. In particular, we consider the classical 1-nearest neighbor algorithm, and show that it achieves sublinear regret - that is, a vanishing mistake rate - against dominated or smoothed adversaries in the realizable setting.


Application of machine learning to gas flaring

Lu, Rong

arXiv.org Artificial Intelligence

Currently in the petroleum industry, operators often flare the produced gas instead of commodifying it. The flaring magnitudes are large in some states, which constitute problems with energy waste and CO2 emissions. In North Dakota, operators are required to estimate and report the volume flared. The questions are, how good is the quality of this reporting, and what insights can be drawn from it? Apart from the company-reported statistics, which are available from the North Dakota Industrial Commission (NDIC), flared volumes can be estimated via satellite remote sensing, serving as an unbiased benchmark. Since interpretation of the Landsat 8 imagery is hindered by artifacts due to glow, the estimated volumes based on the Visible Infrared Imaging Radiometer Suite (VIIRS) are used. Reverse geocoding is performed for comparing and contrasting the NDIC and VIIRS data at different levels, such as county and oilfield. With all the data gathered and preprocessed, Bayesian learning implemented by MCMC methods is performed to address three problems: county level model development, flaring time series analytics, and distribution estimation. First, there is heterogeneity among the different counties, in the associations between the NDIC and VIIRS volumes. In light of such, models are developed for each county by exploiting hierarchical models. Second, the flaring time series, albeit noisy, contains information regarding trends and patterns, which provide some insights into operator approaches. Gaussian processes are found to be effective in many different pattern recognition scenarios. Third, distributional insights are obtained through unsupervised learning. The negative binomial and GMMs are found to effectively describe the oilfield flare count and flared volume distributions, respectively. Finally, a nearest-neighbor-based approach for operator level monitoring and analytics is introduced.


Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier -- A Review

Prasath, V. B. Surya, Alfeilat, Haneen Arafat Abu, Lasassmeh, Omar, Hassanat, Ahmad B. A., Tarawneh, Ahmad S.

arXiv.org Artificial Intelligence

The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested example and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures? This review attempts to answer the previous question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real world datasets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, the results showed large gaps between the performances of different distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN degraded only about $20\%$ while the noise level reaches $90\%$, this is true for all the distances used. This means that the KNN classifier using any of the top $10$ distances tolerate noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing to other distances.